Skip to content

SQL: OOM#584

Open
kbatuigas wants to merge 5 commits into
rp-sqlfrom
DOC-2000-document-feature-oom-prevention
Open

SQL: OOM#584
kbatuigas wants to merge 5 commits into
rp-sqlfrom
DOC-2000-document-feature-oom-prevention

Conversation

@kbatuigas
Copy link
Copy Markdown
Contributor

@kbatuigas kbatuigas commented May 14, 2026

Description

This pull request adds a new troubleshooting guide focused on handling memory-related query cancellations in Redpanda SQL. The page explains the automatic out-of-memory (OOM) protection mechanism, describes the client-facing error, and gives actionable steps for users to recover from or prevent repeated cancellations. It also provides guidance on monitoring memory usage and includes several TODOs for subject matter expert (SME) validation.

Key additions:

Troubleshooting documentation:

  • Added a new page, memory-management.adoc, that explains how Redpanda SQL cancels queries when a node approaches its memory limit and how users can recover from or prevent these cancellations.
  • Provided actionable recommendations for users experiencing repeated OOM cancellations, including reducing query concurrency, simplifying queries, or scaling up the cluster.
  • Documented how to monitor node memory usage using the oxla_process_memory_total Prometheus metric.

Guidance for further validation:

  • Included several SME-directed TODOs to confirm error message details, recommended runbook steps, and configuration options for memory limits.

Resolves https://github.com/redpanda-data/documentation-private/issues/
Review deadline: 21 May

Page previews

Redpanda SQL > Troubleshoot > nav: OOM Cancellations / page title: Troubleshoot Memory-related Query Cancellations

Checks

  • New feature
  • Content gap
  • Support Follow-up
  • Small fix (typos, links, copyedits, etc)

@netlify
Copy link
Copy Markdown

netlify Bot commented May 14, 2026

Deploy Preview for rp-cloud ready!

Name Link
🔨 Latest commit f63c3df
🔍 Latest deploy log https://app.netlify.com/projects/rp-cloud/deploys/6a0d0638e88d8200084537a6
😎 Deploy Preview https://deploy-preview-584--rp-cloud.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f5ff4fca-6b4d-4ba1-bc22-f8a52905197d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch DOC-2000-document-feature-oom-prevention

Comment @coderabbitai help to get the list of available commands and usage tips.

@kbatuigas kbatuigas force-pushed the DOC-2000-document-feature-oom-prevention branch from fc9e91c to 7e0ff93 Compare May 19, 2026 03:30
@kbatuigas kbatuigas changed the title Start OOM doc draft SQL: OOM May 20, 2026
@kbatuigas kbatuigas marked this pull request as ready for review May 20, 2026 00:23
@kbatuigas kbatuigas requested a review from a team as a code owner May 20, 2026 00:23
@kbatuigas kbatuigas requested a review from mattschumpert May 20, 2026 04:02

[source,text]
----
cancelled due to OOM prevention
Copy link
Copy Markdown

@Greketrotny Greketrotny May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cancelled due to OOM prevention is a sibling error to a primary user-facing one: Query Out of Memory.
Query out of Memory is reported when a particular query exhausted all memory resources and had to be cancelled. This is a normal behavior, as the engine counts the allocated memory and prevents it from entering an unexpected state or a deadlock. With this error, it is advised to retry the query or cancel/wait for other concurrently running tasks to finish before retry. I feel like this page is describing this case, but with the wrong error message.
The thing is, the engine doesn't track all allocations, so it doesn't have full control over the allocated memory. This is where the cancelled due to OOM prevention error comes in.

The OOM prevention mechanism is an overseer. It's addressing this by monitoring the overall memory usage in an external, independent way. It's more of an emergency handler, which quickly frees reclaimable resources to remain operational. However, triggering this situation is a result of either the untracked pool exceeding unexpectedly or a serious problem with memory tracking, and should probably almost always result in a bug report by the client with access to the logs. This, I suspect, is more like a runbook/customer support scenario.

I don't know what should be exactly visible in the public documentation, but I feel like this page blends two problems, and there are two parts to describe/discuss, the first one should be definitely visible to the user with an explanation why this happens, and the second (the emergency one) is more like an issue/emergency. Maybe it should be present in the docs too, but on a different page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants